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NONPARAMETRIC BAYESIAN SPARSE FACTOR MODELS WITH 
APPLICATION TO GENE EXPRESSION MODELING 

By David Knowles 1 and Zoubin Ghahramani 2 

University of Cambridge 

A nonparametric Bayesian extension of Factor Analysis (FA) is 
proposed where observed data Y is modeled as a linear superposi- 
tion, G, of a potentially infinite number of hidden factors, X. The 
Indian Buffet Process (IBP) is used as a prior on G to incorporate 
sparsity and to allow the number of latent features to be inferred. The 
model's utility for modeling gene expression data is investigated using 
randomly generated data sets based on a known sparse connectivity 
matrix for E. Coli, and on three biological data sets of increasing 
complexity. 

1. Introduction. Principal Components Analysis (PCA), Factor Anal- 
ysis (FA) and Independent Components Analysis (ICA) are models which 
explain observed data, y n € MP, in terms of a linear superposition of inde- 
pendent hidden factors, x n G M K , so 

(1) y n = Gx n + e n , 

where G is the factor loading matrix and e n is a noise vector, usually as- 
sumed to be Gaussian. These algorithms can be expressed in terms of per- 
forming inference in appropriate probabilistic models. The latent factors are 
usually considered as random variables, and the mixing matrix as a parame- 
ter to estimate. In both PCA and FA the latent factors are given a standard 
(zero mean, unit variance) normal prior. In PCA the noise is assumed to 
be isotropic, whereas in FA the noise covariance is only constrained to be 
diagonal. A standard approach in these models is to integrate out the latent 
factors and find the maximum likelihood estimate of the mixing matrix. In 
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ICA the latent factors are assumed to be heavy-tailed, so it is not usually 
possible to integrate them out. In this paper we take a fully Bayesian ap- 
proach, viewing not only the hidden factors but also the mixing coefficients 
as random variables whose posterior distribution given data we aim to infer. 

Sparsity plays an important role in latent feature models, and is desir- 
able for several reasons. It gives improved predictive performance, because 
factors irrelevant to a particular dimension are not included. Sparse models 
are more readily interpretable since a smaller number of factors are asso- 
ciated with observed dimensions. In many real world situations there is an 
intuitive reason why we expect sparsity: for example, in gene regulatory net- 
works a transcription factor will only regulate genes with specific motifs. In 
our previous work [Knowles and Ghahramani (2007)] we investigated the 
use of sparsity on the latent factors x n , but this formulation is not appro- 
priate in the case of modeling gene expression, where, as described above, 
a transcription factor will regulate only a small set of genes, corresponding 
to sparsity in the factor loadings, G. Here we propose a novel approach 
to sparse latent factor modeling where we place sparse priors on the factor 
loading matrix, G. The Bayesian Factor Regression Model of West et al. 
(2007) is closely related to our work in this way, although the hierarchical 
sparsity prior they use is somewhat different. An alternative "soft" approach 
to incorporating sparsity is to put a Gamma(a,6) (usually exponential, i.e., 
a = 1) prior on the precision of each element of G independently, result- 
ing in the elements of G being marginally Student-t distributed a priori; 
see Fokoue (2004), Fevotte and Godsill (2006), and Archambeau and Bach 
(2009). A LASSO-based approach to generating a sparse factor loading has 
also been developed [Zou, Hastie and Tibshirani (2004); Witten, Tibshirani 
and Hastie (2009)]. We compare these sparsity schemes empirically in the 
context of gene expression modeling. 

A problematic issue with this type of model is how to choose the latent 
dimensionality of the factor space, K. Model selection can be used to choose 
between different values of K, but generally requires significant manual in- 
put and still requires the range of K over which to search to be specified. 
Zhang et al. (2004) applied Reversible Jump MCMC to PCA, which has 
many of the advantages of our approach: a posterior distribution over the 
number of latent dimensions can be approximated, and the number of la- 
tent dimensions could potentially be unbounded. However, RJ MCMC is 
considerably more complex to implement for sparse Factor Analysis than 
our proposed framework. 

We use the Indian Buffet Process [Griffiths and Ghahramani (2006)], 
which defines a distribution over infinite binary matrices, to provide sparsity 
and a framework for inferring the appropriate latent dimension of the data 
set using a straightforward Gibbs sampling algorithm. The Indian Buffet 
Process (IBP) allows a potentially unbounded number of latent factors, so 
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we do not have to specify a maximum number of latent dimensions a priori. 
We denote our model "NSFA" for "Nonparametric Sparse Factor Analysis." 
Our model is closely related to that of Rai and Daume III (2008), and is 
a simultaneous development. 

2. The model. We will define our model in terms of equation (1). Let Z 
be a binary matrix whose (d, k)th element represents whether observed di- 
mension d includes any contribution from factor k. We then model the mix- 
ing matrix by 



where X k is the inverse variance (precision) of the kth factor and 5q is a delta 
function (point-mass) at 0. Distributions of this type are sometimes known 
as "spike and slab" distributions. We allow a potentially infinite number of 
hidden sources, so that Z has infinitely many columns, although only a finite 
number will have nonzero entries. This construction allows us to use the IBP 
to provide sparsity and define a generative process for the number of latent 
factors. 

We assume independent Gaussian noise, s n , with diagonal covariance ma- 
trix We find that for many applications assuming isotropic noise is too 
restrictive, but this option is available for situations where there is strong 
prior belief that all observed dimensions should have the same noise vari- 
ance. The latent factors, x n , are given Gaussian priors. Figure 1 shows the 
complete graphical model. 

2.1. Defining a distribution over infinite binary matrices. We now define 
our infinite model by taking the limit of a series of finite models. 



(2) 



p{gdk\ZdkAk) = Z dk Af(g dk ;0,\ k ) + (1 - Z dk )5 (g dk ) 




Fig. 1. Graphical model. 
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Start with a finite model. We derive the distribution on Z by defining 
a finite K model and taking the limit as K — > oo. We then show how the 
infinite case corresponds to a simple stochastic process. 

We have D dimensions and K hidden sources. Recall that z^k of matrix Z 
tells us whether hidden source k contributes to dimension d. We assume that 
the probability of a source k contributing to any dimension is ir^, and that 
the rows are generated independently. We find 

K D K 

(3) p(zitt) = n n p M*k) = n (i - ^) D - mk , 

jfc=i<j=i k=i 

where = Yld=i z dk is the number of dimensions to which source k con- 
tributes. The inner term of the product is a binomial distribution, so we 
choose the conjugate Beta(r, s) distribution for ir^. For now we take r = j| 
and s = 1, where a is the strength parameter of the IBP. The model is 
defined by 

(4) ^ fc |a~Beta(— ,1 

(5) z d k\^k ~ Bernoulli (7r fc ). 

Due to the conjugacy between the binomial and beta distributions, we are 
able to integrate out it to find 

^ pm TT (a/K)T(m k + a/K)T(D -m k + l) 

(6) P(Z) = n T(D + l + a/K) ' 

where T(-) is the Gamma function. 

Take the infinite limit. Griffiths and Ghahramani (2006) define a scheme 
to order the nonzero rows of Z which allows us to take the limit K — > oo 
and find 

(7) P|Z, = ICW , "* I !J m • 

where K + is the number of active features (i.e., nonzero columns of Z), 
H]j = Ylj=i J i s ^ e Dth harmonic number, and is the number of rows 
whose entries correspond to the binary number h. 

Go to an Indian Buffet. This distribution corresponds to a simple stochas- 
tic process, the Indian Buffet Process. Consider a buffet with a seemingly 
infinite number of dishes (hidden sources) arranged in a line. The first cus- 
tomer (observed dimension) starts at the left and samples Poisson(a) dishes. 
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The ith customer moves from left to right sampling dishes with probability 
^ where m k is the number of customers to have previously sampled dish k. 
Having reached the end of the previously sampled dishes, he tries Poisson( °) 
new dishes. Figure 2 shows two draws from the IBP for two different values 
of a. 

If we apply the same ordering scheme to the matrix generated by this 
process as for the finite model, we recover the correct exchangeable distri- 
bution. Since the distribution is exchangeable with respect to the customers, 
we find by considering the last customer that 

(8) P(z kt = l\z_ kn ) = ^^, 

where m k -t = ^ s ^^ s , which is used in sampling Z. By exchangeability 
and considering the first customer, the number of active sources for each 
dimension follows a Poisson(a) distribution, and the expected number of 
entries in Z is Da. We also see that the number of active features K + = 
J2d=i Poisson(^) = Poisson [aHo)- 

3. Related work. The Bayesian Factor Regression Model (BFRM) of 
West et al. (2007) is closely related to the finite version of our model. The 
key difference is the use of a hierarchical sparsity prior. Each element of G 
has prior of the form 

9dk ~ (1 ~ n dk )5o(gdk) +KdkN(gdk]0A k 1 )- 
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The finite IBP model is equivalent to setting ir^k = it k ~ Beta(a/-fT, 1) and 
then integrating out ir k . In BFRM a hierarchical prior is used: 

Tfdfc ~ (1 -pk)5o(ndk) +PfeBeta(7r dfc ; am, o(l -m)), 

where ~ Beta(sr, s(l — r)). Nonzero elements of ir^k are given a diffuse 
prior favoring larger probabilities [a = 10, m = 0.75 are suggested in West 
et al. (2007)], and p k is given a prior which strongly favors small values, 
corresponding to a sparse solution (e.g., s = D, r = -^). 
Note that on integrating out 7r^, the prior on g dk is 

9dk ~ (1 - rnp k )5 (g dk ) + mp k M{g dk ;0, A^ 1 ). 

This hierarchical sparsity prior is motivated by improved interpretability 
in terms of less uncertainty in the posterior as to whether an element of G 
is nonzero. However, this comes at a cost of significantly increased compu- 
tation and reduced predictive performance, suggesting that the uncertainty 
removed from the posterior was actually important. 

The LASSO-based Sparse PCA (SPCA) method of Zou, Hastie and Tib- 
shirani (2004) and Witten, Tibshirani and Hastie (2009) has similar aims to 
our work in terms of providing a sparse variant of PCA to aid interpreta- 
tion of the results. However, since SPCA is not formulated as a generative 
model, it is not necessarily clear how to choose the regularization parameters 
or dimensionality without resorting to cross-validation. In our experimental 
comparison to SPCA, we adjust the regularization constants such that each 
component explains roughly the same proportion of the total variance as 
the corresponding standard (nonsparse) principal component. 



4. Inference. Given the observed data Y, we wish to infer the hidden 
sources X, which sources are active Z, the mixing matrix G, and all hyperpa- 
rameters. We use Gibbs sampling, but with Metropolis-Hastings (MH) steps 
for sampling new features. We draw samples from the marginal distribution 
of the model parameters given the data by successively sampling the condi- 
tional distributions of each parameter in turn, given all other parameters. 

Since we assume independent Gaussian noise, the likelihood function is 



(9) 



-^(y n -Gx„) T ^ 1 (y re -Gx n ) 



where V is a diagonal noise covariance matrix. 
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Notation. We use — to denote the "rest" of the model, that is, the values 
of all variables not explicitly conditioned upon in the current state of the 
Markov chain. The rth row and cth column of matrix A are denoted A r: 
and A-c respectively. 

Mixture coefficients. We first derive a Gibbs sampling step for an in- 
dividual element of the IBP matrix, Z dk , determining whether factor k is 
active for dimension d. Recall that \ k is the precision (inverse covariance) 
of the factor loadings for the &th factor. The ratio of likelihoods can be cal- 
culated using equation (9). Integrating out the (d, k)th element of the factor 
loading matrix g dk [whose prior is given by equation (2)], we obtain 

( 10 ) P(Y\Zdk = l,-) = J P(Y\g dk ,-)M(g dk ;Q,X^)dg dk 
[ ' P(Y\Z dk = 0,-) P(Y\g dk = 0,-) 

(11) = V / T exp G V )' 

where we have defined A = ip^ 1 X^.X k . 4- \ k and fi = ^j—X^.E d . with the 

matrix of residuals E = Y — GX evaluated with g dk = 0. The dominant 
calculation is that for f/, since the calculation for A can be cached. This 
operation is 0(N) and must be calculated D x K times, so sampling the 
IBP matrix, Z, and factor loading matrix, G, is order O(NDK). 

From the exchangeability of the IBP, we can imagine that dimension d 
was the last to be observed, so that the ratio of the priors is 

(u) P(Z dk = l\-) = m_ dtk 

K ' P{Z dk = 0\-) N-l-m- d y 

where m_ d k is the number of dimensions for which factor k is active, ex- 
cluding the current dimension d. Multiplying equations (11) and (12) gives 
the expression for the ratio of posterior probabilities for Z dk being 1 or 0, 
which is used for sampling. If Z dk is set to 1, we sample g dk \— ~A/"(/i,A _1 ) 
with fi, A defined as for equation (11). 

Adding new features. Z is a matrix with infinitely many columns, but 
the nonzero columns contribute to the likelihood and are held in memory. 
However, the zero columns still need to be taken into account since the 
number of active factors can change. Let K d be the number of columns of Z 
which contain 1 only in row d, that is, the number of features which are 
active only for dimension d. Note that due to the form of the prior for 
elements of Z given in equation (12), K d = for all d after a sampling sweep 
of Z. Figure 3 illustrates K d for a sample Z matrix. 

New features are proposed by sampling K d with a MH step. It is possible 
to integrate out either the new elements of the mixing matrix, g (a 1 x K d 
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factors (dishes) 

Fig. 3. A diagram to illustrate the definition of Kd, for d — 10. 



vector), or the new rows of the latent feature matrix, X' (a n d x N matrix), 
but not both. Since the latter generally has higher dimension, we choose to 
integrate out X' and include g T as part of the proposal. Thus, the proposal 
is £ = {n d ,g}, and we propose a move £ — > £* with probability J(£*|£). In 
this case £ = since, as noted above, K d = initially. The simplest proposal, 
following Meeds et al. (2006), would be to use the prior on £*, that is, 

J(0 = P(K d \a) -p(g\K d ,X k ) = Poisson^; 7) ■ N(g; 0, A^T 1 ), 

where 7 = . 

Unfortunately, the rate constant of the Poisson prior tends to be so small 
that new features are very rarely proposed, resulting in slow mixing. To 
remedy this, we modify the proposal distribution for K d and introduce two 
tunable parameters, 7r and A: 

(13) J(K d ) = (1 - 7r) Poisson (K d ; A7) + irl(K d = 1). 

Thus, the Poisson rate is multiplied by a factor A, and a spike at K d = 1 is 
added with mass 7r. The proposal is accepted with probability min(l,af_>£«) 
where 



(14) 



P(C\-,Y)J(£\C) 

P(Z\-Y)J{?\t) 

P(Y\Z*,-)P(K d \a)p(g\ Kd ,\ k ) _ 
P(Y\-)J(K d )p(g\K d ,X k ) ai 
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(15) a,- P ^- 



P(Y\-) 
P{Kd\a) Poisson^; 7) 



^ 16 ^ ^ J(K d ) Poisson(K d ; A7) ' 

Note that we take </(£!£*) = 1 since £ = 0. To calculate the likelihood ra- 
tio, ai, we need the collapsed likelihood under the new proposal: 

N 

(17) P(Y d ..\e,-) = II / P(Y dn \e,<,-)P«)dx.' 

n=l J 

N 

= n(2vr^ 1 )" 1/2 (2^)^/ 2 |Mr 1 / 2 

(18) 

x expf -(m^Mm n - ^E 2 dni 

where we have defined M = V ; ,7 1 gg T + In d and m n = M~ 1 ^J 1 g^ n with the 
matrix of residuals E = Y — GX. The likelihood under the current sample is 

(19) P(y d ^,-) = II(2vr^- 1 )- 1 / 2 exp(-I^ 1 ^i). 

Substituting these likelihood terms into the expression for the ratio of like- 
lihood terms, ai, gives 

(20) ai = (2k) n ^ |M|-^ 2 exp (\ £ m^Mm n ) . 

We found that appropriate scheduling of the sampler improved mixing, 
particularly with respect to adding new features. The final scheme we settled 
on is described in Algorithm 1. 

IBP parameters. We can choose to sample the IBP strength parame- 
ter a, with conjugate Gamma(e,/) prior (note that we use the inverse scale 
parameterization of the Gamma distribution) . The conditional prior of equa- 
tion (7) acts as the likelihood term and the posterior update is as follows: 

(21) P(a\Z) oc P(Z\a)P(a) = Gamma(a; K + + e,f + H D ), 

where K + is the number of active sources and Hp = Y2f=i J * s the Dth 
harmonic number. 

The remaining sampling steps are standard, but are included here for 
completeness. 
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Algorithm 1 One iteration of the NSFA sampler 
for d = 1 to D do 
for k = 1 to K do 

Sample Z^k 
end for 
Sample /c<j 
end for 

for re = 1 to N do 

Sample X n 
end for 

Sample a, 0, X g 



Latent variables. Sampling the columns x n of the latent variable ma- 
trix X for each t £ [1, . . . , N], we have 

(22) P(x„|-) ocP(y„|x„,-)P(x„) =jV(x n ;/i n , A), 

where we have defined A = G r i/>~ 1 G + / and /x n = A -1 G T ^> y n . Note that 
since A does not depend on re, we only need to compute and invert it once 
per iteration. Calculating A is order 0(K 2 D), and inverting it is 0(K 3 ). 
Calculating fi t is order O(KD) and must be calculated for all N Xf's, a total 
of O(NKD). Thus, sampling X is order 0(K 2 + K 3 + NKD). 

Factor precision. If the mixture coefficient prior precisions Xj~ are con- 
strained to be equal, we have = A ~ Gamma(c, d). The posterior update 
is then given by \\G ~ Gamma(c + ^ k ™ k , d + ^ d k G 2 dk ). 

However, if the variances are allowed to be different for each column 
of G, we set A^ ~ Gamma(c, d), and the posterior update is given by A^|G ~ 
Gamma(c+^,d + £ d G^ fc ). In this case we may also wish to share power 
across factors, in which case we also sample d. Putting a Gamma prior on d 
such that d ~ Gamma(co, do), the posterior update is d\\k ~ Gamma(co + 
cK,d + Ef=i A fc)- 

Noise variance. The additive Gaussian noise can be constrained to be 
isotropic, in which case the inverse variance is given a Gamma prior: if)^ 1 = 
tp^ 1 ~ Gamma(a,6) which gives the posterior update ~ Gamma(a + 

However, if the noise is only assumed to be independent (which we have 
found to be more appropriate for gene expression data), then each dimen- 
sion has a separate noise variance, whose inverse is given a Gamma prior: 
vp^ 1 ~ Gamma(a, b) which gives the posterior update ip~^ 1 | — ~ Gamma(a + 
Y , b + E% n ) where the matrix of residuals E = Y — GX. We can share 
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power between dimensions by giving the hyperparameter b a hyperprior 
Gamma(ao, &o) resulting in the Gibbs update b\— ~ Gamma(ao + aD,bo + 
Y2d=l ^d, 1 )- This hierarchical prior results in soft coupling between the noise 
variances in each dimension, so we will refer to this variant as sc. 

5. Results. We compare the following models: 

• FA — Bayesian Factor Analysis; see, for example, Kaufman and Press (1973) 
or Rowe and Press (1998). 

• AFA — Factor Analysis with ARD prior to determine active sources. 

• FOK — The sparse Factor Analysis method of Fokoue (2004), Fevotte and 
Godsill (2006) and Archambeau and Bach (2009). 

• SPCA— The Sparse PCA method of Zou, Hastie and Tibshirani (2004). 

• BFRM — Bayesian Factor Regression Model of West et al. (2007). 

• SFA — Sparse Factor Analysis, using the finite IBP. 

• NSFA — The proposed model: Nonparametric Sparse Factor Analysis. 

Note that all of these models can be learned using the software package 
we provide simply by using appropriate settings. 

5.1. Synthetic data. Since generating a connectivity matrix Z from the 
IBP itself would clearly bias toward our model, we instead use the D = 100 
gene by K = 16 factor E. Coli connectivity matrix derived in Kao et al. 
(2004) from RegulonDB and current literature. We ignore whether the con- 
nection is believed to be up or down regulation, resulting in a binary ma- 
trix Z. We generate random data sets with iV = 100 samples by drawing 
the nonzero elements of G (corresponding to the elements of Z which are 
nonzero), and all elements of X, from a zero mean unit variance Gaussian, 
calculating Y = GX + E, where E is Gaussian white noise with variance set 
to give a signal to noise ratio of 10. 

Here we will define the reconstruction error E r as 

where 67, K are the inferred quantities. Although we minimize over permu- 
tations, we do not minimize over rotations since, as noted in Fokoue (2004), 
the sparsity of the prior stops the solution being rotation invariant. We av- 
erage this error over the last ten samples of the MCMC run. This error 
function does not penalize inferring extra spurious factors, so we will inves- 
tigate this possibility separately. The precision and recall of active elements 
of the Z achieved by each algorithm (after thresholding for the nonsparse 
algorithms) are presented in the Supplementary Material, but omitted here 
since the results are consistent with the reconstruction error. 
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Fig. 4. Boxplot of reconstruction errors for simulated data derived from the E. Coli con- 
nectivity matrix of Kao et al. (2004)- Ten data sets were generated and the reconstruction 
error calculated for the last ten samples from each algorithm. Numbers refer to the number 
of latent factors used, K . al denotes fixing a = 1. sn denotes sharing power between noise 
dimensions. 



The reconstruction error for each method with different numbers of latent 
features is shown in Figure 4. Ten random data sets were used and for the 
sampling methods (all but SPCA) the results were averaged over the last 
ten samples out of 1000. Unsurprisingly, plain Factor Analysis (FA) performs 
the worst, with increasing overfitting as the number of factors is increased. 
For K = 20 the variance is also very high, since the four spurious features fit 
noise. Using an ARD prior on the features (AFA) improves the performance, 
and overfitting no longer occurs. The reconstruction error is actually less for 
K = 20, but this is an artifact due to the reconstruction error not penalizing 
additional spurious features in the inferred G. The Sparse PCA (SPCA) of 
Zou, Hastie and Tibshirani (2004) shows improved reconstruction compared 
to the nonsparse methods (FA and AFA) , but does not perform as well as the 
Bayesian sparse models. Sparse factor analysis (SFA), the finite version of 
the full infinite model, performs very well. The Bayesian Factor Regression 
Model (BFRM) performs significantly better than the ARD factor analysis 
(AFA), but not as well as our sparse model (SFA). It is interesting that for 
BFRM the reconstruction error decreases significantly with increasing K, 
suggesting that the default priors may actually encourage too much sparsity 
for this data set. Fokoue's method (FOK) only performs marginally better 
than AFA, suggesting that this "soft" sparsity scheme is not as effective at 
finding the underlying sparsity in the data. Overfitting is also seen, with 
the error increasing with K. This could potentially be resolved by placing 
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Fig. 5. Histograms of the number of latent features inferred by the nonparametric sparse 
FA sampler for the last 100 samples out of 1000. Left: With a = l. Right: Inferring a. 

an appropriate per factor ARD-like prior over the scale parameters of the 
Gamma distributions controlling the precision of elements of G. Finally, the 
Nonparametric Sparse Factor Analysis (NSFA) proposed here and in Rai and 
Daume III (2008) performs very well. With fixed a = 1 (al) or inferring a, 
we see very similar performance. Using the soft coupling (sc) variant which 
shares power between dimensions when fitting the noise variances seems to 
reduce the variance of the sampler, which is reasonable in this example since 
the noise was in fact isotropic. 

Since the reconstruction error does not penalize spurious factors, it is im- 
portant to check that NSFA is not scoring well simply by inferring many 
additional factors. Histograms for the number of latent features inferred for 
the nonparametric sparse model are shown in Figure 5. This represents an 
approximate posterior over K. For fixed a = 1 the distribution is centered 
around the true value of K = 16, with minimal bias (EK = 16.1). The vari- 
ance is significant (standard deviation of 1.46), but is reasonable considering 
the noise level (SNR = 10) and that in some of the random data sets, ele- 
ments of Z which are 1 could be masked by very small corresponding values 
of G. This hypothesis is supported by the results of a similar experiment 
where G was set equal to Z. In this case, the sampler always converged to 
at least 16 features, but would also sometimes infer spurious features from 
noise (results not shown). When inferring a some bias and skew are notice- 
able. The mean of the posterior is now at 18.3 with standard deviation 2.0, 
suggesting there is little to gain from sampling a in this data. 

5.2. Convergence. NSFA can suffer from slow convergence if the number 
of new features is drawn from the prior. Figure 6 shows how the different 
proposals for k<i effect how quickly the sampler reaches a sensible number 
of features. If we use the prior as the proposal distribution, mixing is very 
slow, taking around 5000 iterations to converge, as shown in Figure 6(a). If 
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Fig. 6. The effect of different proposal distributions for the number of new features. 
(a) Prior, (b) Prior plus 0.1I(k = 1). (c) Factor A = 50. 



a mass of 0.1 is added at Kd = 1 [see equation (13)], then the sampler reaches 
the equilibrium number of features in around 1500 iterations, as shown in 
Figure 6(b). However, if we try to add features even faster, for example, 
by setting the factor A = 50 in equation (13), then the sampling noise is 
greatly increased, as shown in Figure 6(c), and the computational cost also 
increases significantly because so many spurious features are proposed only 
to be rejected. 

5.3. Biological data: E. Coli time-series dataset. To assess the perfor- 
mance of each algorithm on the biological data where no ground truth is 
available, we calculated the test set log likelihood under the posterior. Ten 
percent of entries from Y were removed at random, ten times, to give ten 
data sets for inference. We do not use mean square error as a measure of 
predictive performance because of the large variation in the signal to noise 
ratio across gene expression level probes. 

The test log likelihood achieved by the various algorithms on the E. Coli 
data set from Kao et al. (2004), including 100 genes at 24 time-points, is 
shown in Figure 7(a). On this simple data set incorporating sparsity doesn't 
improve predictive performance. Overfitting the number of latent factors 
does damage performance, although using the ARD or sparse prior allevi- 
ates the problem. Based on predictive performance of the finite models, five 
is a sensible number of features for this data set: the NSFA model infers 
a median number of 4 features, with some probability of there being 5, as 
shown in Figure 7(b). 

5.4. Breast cancer data set. We assess these algorithms in terms of pre- 
dictive performance on the breast cancer data set of West et al. (2007), 
including 226 genes across 251 individuals. We find that all the finite mod- 
els are sensitive to the choice of the number of factors, K. The samplers were 
found to have converged after around 1000 samples according to standard 
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(a) (b) 

Fig. 7. Results on E. Coli time-series data set from Kao et al. (2004) (N = 24, D = 100, 
3000 MCMC iterations), (a) Log likelihood of test data under each model based on the last 
100 MCMC samples. The boxplots show variation across 10 different random splits of 
the data into training and test sets, (b) Number of active latent features during a typical 
MCMC run of the NSFA model. 



multiple chain convergence measures, so 3000 MCMC iterations were used 
for all models. The predictive log likelihood was calculated using the final 
100 MCMC samples. Figure 8(a) shows test set log likelihoods for 10 ran- 
dom divisions of the data into training and test sets. Factor analysis (FA) 
shows significant overfitting as the number of latent features is increased 
from 20 to 40. Using the ARD prior prevents this overfitting (AFA), giving 
improved performance when using 20 features and only slightly reduced per- 
formance when 40 features are used. The sparse finite model (SFA) shows 
an advantage over AFA in terms of predictive performance as long as un- 
derfitting does not occur: performance is comparable when using only 10 
features. However, the performance of SFA is sensitive to the choice of the 
number of factors, K. The performance of the sparse nonparametric model 
(NSFA) is comparable to the sparse finite model when an appropriate num- 
ber of features is chosen, but avoids the time consuming model selection 
process. Fokoue's method (FOK) was run with K = 20 and various settings 
of the hyperparameter d which controls the overall sparsity of the solution. 
The model's predictive performance depends strongly on the setting of this 
parameter, with results approaching the performance of the sparse models 
(SFA and NSFA) for d= 10 -4 . The performance of BFRM on this data set 
is noticeably worse than the other sparse models. 

We now consider the computation cost of the algorithms. As described in 
Section 4, sampling Z and G takes order O(NKD) operations per iteration, 
and sampling X takes 0(K 2 + K s + ND) . However, for the moderate values 
encountered for data sets 1 and 2, the main computational cost is sampling 
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Fig. 8. Results on breast cancer data set (N = 251, D = 226, 3000 MCMC iterations). 
(a) Predictive performance: log likelihood of test (the 10% missing) data under each model 
based on the last 100 MCMC samples. Higher values indicate better performance. The 
boxplots show variation across 10 different random splits of the data into training and test 
sets, (b) CPU time (in seconds) per iteration, averaged across the 3000 iteration run. (c) 
CPU time (in seconds) per iteration divided by the number of features at that iteration, 
averaged across all iterations. 



the nonzero elements of G, which takes 0((1 — s)DK) where s is the sparsity 
of the model. Figure 8(c) shows the mean CPU time per iteration divided 
by the number of features at that iteration. Naturally, straight FA is the 
fastest, taking only around 0.025s per iteration per feature. The value in- 
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creases slightly with increasing K, suggesting that here the 0(K 2 D + K 3 ) 
calculation and inversion of A, the precision of the conditional on X, must be 
contributing. The computational cost of adding the ARD prior is negligible 
(AFA). The CPU time per iteration is just over double for the sparse finite 
model (SFA), but the cost actually decreases with increasing K, because 
the sparsity of the solution increases to avoid overfitting. There are fewer 
nonzero elements of G to sample per feature, so the CPU time per feature 
decreases. The CPU time per iteration per feature for the nonparametric 
sparse model (NSFA) is somewhat higher than for the finite model because 
of the cost of the feature birth and death process. However, Figure 8(b) 
shows the absolute CPU time per iteration, where we see that the nonpara- 
metric model is only marginally more expensive than the finite model of 
appropriate size K = 15 and cheaper than choosing an unnecessarily large 
finite model (SFA with K = 20, 40). Fokoue's method (FOK) has compara- 
ble computational performance to the sparse finite model, but interestingly 
has increased cost for the optimal setting of d = 10~ 4 . The parameter space 
for FOK is continuous, making search easier but requiring a normal random 
variable for every element of G. BFRM pays a considerable computational 
cost for both the hierarchical sparsity prior and the DP prior on X. SPCA 
was not run on this data set, but results on the synthetic data in Section 5.1 
suggest it is somewhat faster than the sampling methods, but not hugely 
so. The computational cost of SPCA is ND 2 + mO(D 2 K + DK 2 + D 3 ) in 
the N > D case (where m is the number of iterations to convergence) and 
ND 2 + mO(D 2 K + DK 2 ) in the D > N case, taking the limit A ->• oo. In 
either case an individual iteration of SPCA is more expensive than one sam- 
pling iteration of NSFA (since K < D), but fewer iterations will generally be 
required to reach convergence of SPCA than are required to ensure mixing 
of NSFA. 

5.5. Prostate cancer data set. Figure 9 shows the predictive performance 
of AFA, FOK and NSFA on the prostate cancer data set of Yu et al. (2004), 
for ten random splits into training and test data. The boxplots show varia- 
tion from ten random splits into training and test data. The large number 
of genes {D = 12557 across N = 171 individuals) in this data set makes in- 
ference slower, but the problem is manageable since the computational com- 
plexity is linear in the number of genes. Despite the large number of genes, 
the appropriate number of latent factors, in terms of maximizing predictive 
performance, is still small, around 10 (NSFA infers a median of 12 factors). 
This may seem small relative to the number of genes, but it should be noted 
that the genes included in the breast cancer and E. Coli data sets are those 
capturing the most variability. Surprisingly, SFA actually performed slightly 
worse on this data set than AFA. Both are highly sensitive to the number of 
latent factors chosen. NSFA, however, gives better predictive log likelihoods 
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Fig. 9. Test set log likelihoods on Prostate cancer data set from Yu et al. (2004), including 
12557 genes across 111 individuals (1000 MCMC iterations). 



than either finite model for any fixed number of latent factors K. Running 
1000 iterations of NSFA on this data set takes under 8 hours. BFRM and 
FOK were impractically slow to run on this data set. 

6. Discussion. We have seen that in both the E. Coli and breast can- 
cer data sets that sparsity can improve predictive performance, as well as 
providing a more easily interpretable solution. Using the IBP to provide 
sparsity is straightforward, and allows the number of latent factors to be 
inferred within a well-defined theoretical framework. This has several ad- 
vantages over manually choosing the number of latent factors. Choosing too 
few latent factors damages predictive performance, as seen for the breast 
cancer data set. Although choosing too many latent factors can be com- 
pensated for by using appropriate ARD-like priors, we find this is typically 
more computationally expensive than the birth and death process of the 
IBP. Manual model selection is an alternative but is time consuming. Fi- 
nally, we show that running NSFA on full gene expression data sets with 
10000+ genes is feasible so long as the number of latent factors remains rel- 
atively small. An interesting direction for this research is how to incorporate 
prior knowledge, for example, if certain transcription factors are known to 
regulate specific genes. Incorporating this knowledge could both improve the 
performance of the model and improve interpretability by associating latent 
variables with specific transcription factors. Another possibility is incorpo- 
rating correlations in the Indian Buffet Process, which has been proposed 
for simpler models [Doshi-Velez and Ghahramani (2009); Courville, Eck and 
Bengio (2009)]. This would be appropriate in a gene expression setting where 
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multiple transcription factors might be expected to share sets of regulated 
genes due to common motifs. Unfortunately, performing MCMC in all but 
the simplest of these models suffers from slow mixing. 

Acknowledgments. We would like to thank the anonymous reviewers for 
helpful comments. 

SUPPLEMENTARY MATERIAL 

Graphs of precision and recall for the synthetic data experiment. (DOI: 
10.1214/10-AOAS435SUPP; .pdf). The precision and recall of active ele- 
ments of the Z matrix achieved by each algorithm (after thresholding for 
the nonsparse algorithms) on the synthetic data experiment, described in 
Section 5.1. The results are consistent with the reconstruction error. 
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